Student team: NO
Tool(s):
We used Microsoft SQL server and a java
program developed using Prefuse library in July 2008 for this challenge. In our
approach, we first organize and load the raw data into a relational database.
User can retrieve subset of data of interest by submitting customized SQL
queries. Our tool can then be used to visualize the returned result sets.
Combined with SQL and database engine, users can easily manipulate the data stored
in database system for preprocessing and analysis. Here, we implemented this
tool for this particular challenge. Our implementation and analysis process
includes both Java program for visualization as well as a set of SQL
statements. We used Microsoft SQL server as underlying database management
system for managing the phone records.
Tow page summary: No
Answers:
Phone-1: What is the Catalano/Vidro social network, as reflected in the cell phone call data, at the end of the time period?
Phone-2
Characterize the changes in the Catalano/Vidro social structure over the ten
day period.
Detailed Answers:
Identify
the phone IDs directly associated with members of Catalan/Vidro families and
create the network graph.
We start with loading the phone record into a relational table as following:
CREATE TABLE [Calls]([From]
[int] NULL, [To] [int] NULL,
[Datetime]
[nvarchar](50) NULL,[Duration(seconds)] [int] NULL,
[
[minute]
[int] NULL)
Given caller ID, i.e. @cId, we can uses following queries to gather all callers have been communicating with this caller together with number of calls, number of days of calling and duration of all call.
with
unique_callers
as
(select [from] as
callers from calls where [to]=@cId
union
select [to] as callers from calls
where [from]=@cId)
select @cid as caller1, u.callers as caller2
,count(*) as total_calls
, count(distinct [day]) as call_days
, sum([duration(seconds)]) as call_time
from calls c, unique_callers u
where
(c.[from]=@cID and c.[to]=u.[callers])
or (c.[to]=@cID and c.[from]= u.[callers])
group by u.callers
order by call_days desc, total_calls desc, call_time desc
Given @cID as 200, we can get following table using above query.
Caller 1 |
Caller 2 |
Total_calls |
Call_days |
Call_time |
200 |
5 |
14 |
7 |
15183 |
200 |
1 |
9 |
7 |
8898 |
200 |
2 |
8 |
6 |
8143 |
200 |
3 |
8 |
5 |
8520 |
200 |
137 |
5 |
2 |
5418 |
200 |
97 |
3 |
2 |
3681 |
Table-1 unique callers of caller 200 over 10 days.
Caller 200 had most incoming/outgoing calls, most days, and longest durations with caller 5. Based on the hint that “We believe Ferdinando would call brother Estaban most frequently”, we think caller number 5 is Estaban Catalano. Next, we would like to identify who is likely to be David Vidro. Given the hint that “We also believe that David Vidro coordinates high-level Paraiso activities and communications.“, Although it is possible to compute the connectivity and influence through SQL statements, we find out our visualization can actually gives an easy answer. Our visualization tools treated the SQL results as a relational table, by specify the source column and target column explicitly, we can create a graph of the returned table. Therefore, we can easily construct a network graph based on the phone call activities over 10 day period by visualizing the entire call table as graph (Figure-1).
Figure-1 connection graph based on 10-day period of phone call activities.
Our visualization tool allows us to move nodes around and highlight a set of nodes of interests by clicking on them. When mouse over a particular node, it can also shows all neighboring nodes it can reaches. Figure-1 shows an example of mouse over caller number 5.
Upon constructing the network graph for entire period, we first fixed caller 1, 2, 3, 5, 97, 137 and 200. We then immediately recognize additional caller id with high connectivity. Those nodes include caller 0, 306, 309, 360. Furthermore, we can move callers with similar neighbor together. We observed following:
1) Caller 1 and node 5 seems to be the two nodes with largest connectivity.
2) There is little overlap between the nodes directly connected with caller 1 and caller 5.
3) The following pairs of callers seems connect to same set of callers: caller 1 and caller 309, caller 5 and caller 306, caller 3 and caller 360, caller 2 and caller 0.
We then try regroup the network graph node by drag high connectivity nodes with similar neighbor nodes together. Observing both caller 2 and caller 137 are two nodes that may have shortest way to reach caller 1/309, caller 5/306, caller 2/0 and caller 3/360, we believe they are two most likely candidates for David Vidro. Considering David most involved in coordinating “high level” activities, we think the direct connectivity of David should be small. Therefore, we conclude that caller 137 to be David Vidro.
Also based on the observation of the graph, the people directly associated with caller 2, 3 and 5 is more close to the people directly associated with caller 1 and caller 97. We think caller 2 and caller 3 to be Juan Vidro and Jorge Vidro. However we cannot tell which one is Juan and which one is Jorge.
Therefore, we think the identifiable member of Catalano/Vidro family is
200: Ferdinando Catalano
5: Estaban Catalano
137:
David Vidro
2/3:
Juan Vidro/Jorge Vidro
Observation on the network graph over the ten day periods to note
the changes in the Catalano/Vidro social structures.
We then observe the changes of Catalano/Vidro family social structure over ten day period by following the above callers.
Our approach used here is similar to what we used to identify members of Catalano/Vidro family members. We consider changes on daily basis by constructing a daily connectivity graph of nodes that closely associated with Catalano/Vidro families. The dataset used to generate the network graph can easily generated with following query:
Select
* from calls where [day] = d;
Here the d is the date of day of interests. Although, we draw our conclusion on daily basis, we can also easily generate a snap shot of network graph with any interval. During our study, we also included observation of caller 1 and caller 97 as they are also likely members of unidentifiable member of Catalano/Vidro families. Figure-2 shows graph on 0601 as an example
Figure-2
Network structure snapshot shows caller 1, 2, 3, 5, 97, 137 and 200 with their direct associates.
While observed social network snapshot for each day, we noticed following:
The social structure of Caller 137 has been relative stable throughout the 10-days period. This fits our hypothesis that he is David Vidro because we expect the person who coordinates high level activities should be in a stable situation.
There is no phone call activity for caller 200 on Jun 8, caller 3 and 5 on Jun 8 and 9, and caller 2 on Jun 9. The phone call activities resume on Jun 10 for caller 2, 3 and 5. However, they only make outgoing calls and no incoming calls. This is a very unusual as anyone of caller 2, 3 and 5 have high connectivity before the “no activity”. During June 1st to June 7/8, each one can get between 20 and 50 incoming calls and only makes several outgoing calls.
We formulate following hypothesis that caller 2, 3 and 5 is intended to leave country, probably in an illegal way! We speculate that the plan is to have caller 200 to help caller 3 and 5 preparing on June 8. And caller 2 joined them on June 9. Unfortunately, the plan failed or unexpected terminated so they come back to island before expected. Our hypothesis is supported by following evidence:
1) Both the number of calls and the duration of total phone calls, each of three callers get on the day before they “leave” are relative high based on their daily activities. We suspect a lot of close friends and relatives called in to say farewell. They must also inform them their plan. This can explain why they do not get phone calls on Jun 10 after they are get back on island.
2) We suspect that caller 200 also knew their plan and were with them on Jun 8 to help them out. We observed that caller 137 called caller 200 on Jun 9 at early morning. This is unusual because caller 137 did not call caller 200 often before. We suspect that caller 137 knew the plan and want to check status.
3) All of the caller 200, 2, 3 and 5 are evidently in port cities as indicated by cell phone tower where they originated calls.